feat: update e2e RAG eval #292

e06084 · 2025-12-16T11:30:02Z

No description provided.

gemini-code-assist · 2025-12-16T11:30:27Z

Summary of Changes

Hello @e06084, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求引入了一个全新的、更真实的端到端 RAG 评估流程，专注于 FiQA 数据集。通过直接从 HuggingFace 加载数据并使用 Dingo 框架进行多维度评估，它提供了一个标准化的基准测试方法，取代了之前基于本地模拟数据的评估方式，从而显著提升了 RAG 系统评估的准确性和可复现性。

Highlights

新增端到端 RAG 评估脚本: 新增了 e2e_RAG_eval_with_mockRAG_fiqa.py 脚本，用于对 FiQA 数据集进行全面的端到端 RAG 系统评估。
集成 HuggingFace 数据集: 新的评估脚本直接从 HuggingFace 下载 FiQA 语料库和测试数据，确保评估过程的标准化和可复现性。
全面的 Dingo 评估指标: 利用 Dingo 框架对 RAG 系统的输出进行多维度评估，包括忠实度、上下文精确度、召回率、相关性和答案相关性等关键指标。
文档更新: 更新了中文 RAG 评估文档 (docs/rag_evaluation_metrics_zh.md)，以反映新的端到端评估脚本和重命名的基线评估脚本。
旧脚本移除: 移除了旧的 eval_with_mock_rag.py 模拟 RAG 评估脚本，其功能已被更先进的端到端评估方案取代。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the end-to-end RAG evaluation examples. It removes an old example script and adds a more comprehensive one (e2e_RAG_eval_with_mockRAG_fiqa.py) that performs a full RAG evaluation on the FiQA dataset. The documentation is also updated to reflect these changes. The new example script is well-structured and provides a solid end-to-end workflow. My review focuses on improving its maintainability and robustness. I've suggested making default parameters more consistent, handling file paths more robustly, cleaning up return values, and reducing code duplication. These changes should make the example script easier to understand, use, and maintain.

gemini-code-assist · 2025-12-16T11:32:10Z

examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

+    """简单的 RAG 系统"""
+
+    def __init__(self, llm_client: AsyncOpenAI, retriever: FiQACorpusRetriever,
+                 system_prompt: str = None, model: str = "gpt-3.5-turbo"):


The default value for the model parameter is "gpt-3.5-turbo", which is inconsistent with the rest of the script that defaults to using "deepseek-chat" from the OPENAI_MODEL environment variable. This can be misleading for anyone using this class. To improve consistency and maintainability, I suggest making the default value consistent with the global configuration.

Suggested change

system_prompt: str = None, model: str = "gpt-3.5-turbo"):

system_prompt: str = None, model: str = "deepseek-chat"):

gemini-code-assist · 2025-12-16T11:32:10Z

examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

+        return {
+            "answer": answer,
+            "retrieved_documents": docs,
+            "context_list": [doc.page_content for doc in docs]
+        }


The query method returns a retrieved_documents field which is a list of langchain_core.documents.Document objects. These objects are not JSON serializable, and this field is not used by the calling function generate_rag_responses (which uses context_list instead). To improve clarity and avoid returning unnecessary, non-serializable data, I suggest removing retrieved_documents from the return dictionary. A similar change should be applied to the return statement in the if not docs: block on lines 167-171.

return { "answer": answer, "context_list": [doc.page_content for doc in docs] }

gemini-code-assist · 2025-12-16T11:32:10Z

examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

+def print_metrics_summary(summary: SummaryModel):
+    """打印指标统计摘要"""
+    if not summary.metrics_score_stats:
+        print("⚠️  没有指标统计数据")
+        return
+
+    print("\n" + "=" * 80)
+    print("📊 RAG 评估指标统计")
+    print("=" * 80)
+
+    for field_key, metrics in summary.metrics_score_stats.items():
+        print(f"\n📁 字段组: {field_key}")
+        print("-" * 80)
+
+        for metric_name, stats in metrics.items():
+            display_name = metric_name.replace("LLMRAG", "")
+            print(f"\n  {display_name}:")
+            print(f"    平均分: {stats.get('score_average', 0):.2f}")
+            print(f"    最小分: {stats.get('score_min', 0):.2f}")
+            print(f"    最大分: {stats.get('score_max', 0):.2f}")
+            print(f"    样本数: {stats.get('score_count', 0)}")
+            if 'score_std_dev' in stats:
+                print(f"    标准差: {stats.get('score_std_dev', 0):.2f}")
+
+        overall_avg = summary.get_metrics_score_overall_average(field_key)
+        print(f"\n  🎯 该字段组总平均分: {overall_avg:.2f}")
+
+        metrics_summary = summary.get_metrics_score_summary(field_key)
+        sorted_metrics = sorted(metrics_summary.items(), key=lambda x: x[1], reverse=True)
+
+        print("\n  📈 指标排名（从高到低）:")
+        for i, (metric_name, avg_score) in enumerate(sorted_metrics, 1):
+            display_name = metric_name.replace("LLMRAG", "")
+            print(f"    {i}. {display_name}: {avg_score:.2f}")
+
+    print("\n" + "=" * 80)


The function print_metrics_summary seems to be a utility function that is also present in other example scripts like examples/rag/dataset_rag_eval_baseline.py. To adhere to the DRY (Don't Repeat Yourself) principle and improve maintainability, consider moving this function to a shared utility module (e.g., examples/rag/utils.py) and importing it where needed.

gemini-code-assist · 2025-12-16T11:32:10Z

examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

+    elif args.limit:
+        output_filename = f"fiqa_end_to_end_rag_output_limit_{args.limit}.jsonl"
+
+    output_path = "test/data/" + output_filename


Using + for path concatenation is not robust and can lead to issues on different operating systems. It's better to use os.path.join() to construct file paths. Additionally, the output directory "test/data/" is hardcoded, which makes the script less flexible. Consider making this a configurable parameter, for example, via a command-line argument.

output_path = os.path.join("test/data", output_filename)

feat: update e2e RAG eval

ad2e939

e06084 force-pushed the dev branch from 01cd69f to ad2e939 Compare December 16, 2025 11:31

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

e06084 merged commit 6e1afc6 into MigoXLab:dev Dec 16, 2025
2 checks passed

tenwanft pushed a commit to tenwanft/dingo that referenced this pull request Dec 24, 2025

feat: update e2e RAG eval (MigoXLab#292)

f06c2d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: update e2e RAG eval #292

feat: update e2e RAG eval #292

Uh oh!

e06084 commented Dec 16, 2025

Uh oh!

gemini-code-assist bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 16, 2025

Uh oh!

gemini-code-assist bot Dec 16, 2025

Uh oh!

gemini-code-assist bot Dec 16, 2025

Uh oh!

gemini-code-assist bot Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	system_prompt: str = None, model: str = "gpt-3.5-turbo"):
	system_prompt: str = None, model: str = "deepseek-chat"):

feat: update e2e RAG eval #292

feat: update e2e RAG eval #292

Uh oh!

Conversation

e06084 commented Dec 16, 2025

Uh oh!

gemini-code-assist bot commented Dec 16, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant